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Abstract 

Many real-world problems encountered in several disciplines deal with the modeling of time-series 
containing different underlying dynamical regimes, for which probabilistic approaches are very often 
employed. In this paper we describe several such approaches in the common framework of graphical 
models. We give a unified overview of models previously introduced in the literature, which is simpler and 
more comprehensive than previous descriptions and enables us to highlight commonalities and differences 
among models that were not observed in the past. In addition, we present several new models and 
inference routines, which are naturally derived within this unified viewpoint. 



1 Introduction 

Several problems encountered in application areas such as finance, biology, speech analysis, control 
engineering, robotics, etc. require the modeling of time-series containing switching among different 
dynamics regimes (see Ephraim (20021 for a review). For example, system fault diagnosis deals with 
detecting behavioural deviations from normality originated by failures in the system. 

Such a modeling is often achieved by employing probabilistic approaches in which regime switching 
is described by a set of discrete hidden random variables, related by a first-order Markovian dependence. 
All such models, that we call hidden Markov switching models (HMSMs), can be viewed as extensions 
of the popular hidden Markov model |Rabiner] ( |1989[ ) . 

The wide interdisciplinary attention to this research area has produced many different HMSMs as well 
as different approaches and implementations of HMSMs of fundamentally similar structure, resulting in a 
dense literature from which extracting differences and commonalities among models is often challenging. 

In this paper we provide a simple unified treatment of existing HMSMs, highlighting properties and 
connections that were not observed in previous review papers Ephraim (20021; Gales and Young (19931; 
|Murphy| ( |2002[ ); |Ostendorf et al.| ( |1996| >; |Rabiner| ( |1989[ ); |Yu| fl2010[ ), and introduce novel extensions. 
Our exposition enables a deep understanding of the fundamental structure and relations of different 
approaches. This is achieved by using the framework of graphical models, which allows to easily define 
complex models by using a graphical representation and to derive efficient inference routines by visual 
inspection of the graph, avoiding complex algebraic manipulations. 



2 Independence in Belief Networks 



One of the fundamental tasks in probabilistic time-series modeling is the development efficient inference 
routines, which requires exploiting statistical independence relations among random variable ^*) By using 
the framework of graphical models and, in particular, of belief networks |Barber| ( |2011[ ); 



Koller and 



Friedman ( 2009 1; Pearl ( 1988 1, statistical independence can be assessed by visual inspection of the graph, 
and therefore inference can be achieved without the need of algebraic manipulations. In this section, 
we introduce some basic definitions of the graphical model theory and explain two equivalent methods 
for assessing statistical independence. These methods form the basis for the derivations of the inference 
routines presented in the rest of the paper. 



Due to space limitations, parameter learning is not discussed in this exposition. This problem however does not normally 
pose particular challenges once inference has been achieved. 
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Figure 1: (a) Example of directed acyclic graph. The node X3 is a collider in the path 22, £3, xi, but it is a non-collider in the 
path X2, £4, £3, x\. Adding a link from £4 to x\ would result in a cyclic graph, (b) Belief network representation of a hidden 
Markov model. Rectangular nodes indicate discrete variables, whilst oval nodes indicate discrete or continuous variables. Filled 
nodes indicate that the variables are observed. 



Basic Graphical Model Definitions 

A graphical model is a graph in which nodes represent random variables and (undirected or directed) 
links represent statistical dependencies among variables. 

A path from node Xi to node Xj is a sequence of linked nodes connecting Xi to Xj . 

A node Xi with a direct link towards Xj is called parent of Xj . In this case, Xj is called child of Xi . 

An undirected path from Xi to Xj has a collider at Xk if there are two arrows along the path pointing 
towards Xk ■ Notice that a node can be a collider in a path, but a non-collider in another path. For exam- 
ple, in Fig. [l](a) X3 is a collider in the path X2,X3,xi, but it is a non-collider in the path X2,X4,,x$,x\. 

A directed acyclic graph is a graph with no directed paths starting and ending at the same node. For 
example, the directed graph in Fig. [I] (a) is acyclic. Adding a link from 24 to x\ would make the graph 
cyclic. 

A belief network is a directed acyclic graph in which each node x, has associated the conditional distri- 
bution p(xi\parents(xi)) . The joint distribution of all nodes in the graph is given by the product of all 
conditional distributions 

D 

p(xi;D = x\,... ,x D ) = Y[p{ x i\P aren t s { x i))- 

1=1 

A node Xi is called ancestor of a node Xj if there exists a directed path from Xi to Xj . In this case, Xj 
is called descendant of Xi. 

Assessing Statistical Independence 

Method I. Given the sets of random variables X, y and Z, X and y are statistically independent given 
Z (denoted with X _LL y \ Z) if all paths from any element of X to any element of y are blocked. A 
path is blocked if at least one of the following conditions is satisfied 

1. There is a non-collider in the path which is in the conditioning set Z. 

2. There is a collider in the path such that nor the collider nor any of its descendants is in Z. 

Method II. This method consists of converting the directed graph into an undirected one and then 
using the rules of independence for undirected graphs. This is achieved by the following steps: 

1. Create the ancestral graph: remove any node which is neither in X U y U Z nor an ancestor of a 
node in this set, together with any links in or out of such nodes. 

2. Perform moralisation: add a link between any two nodes which have a common child. Remove 
arrowheads. 

3. Use independence rules for undirected graphs: if all paths which join a node in X to one in y pass 
through any member of Z then X _LL y | Z. 
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3 Hidden Markov Model 



We start our analysis with the Hidden Markov Model (HMM), which represents the basic model from 
which extensions will be derived. The HMM has a belief network representation given in Fig. [T] (b), in 
which the discrete or continuous (possibly multivariate) variable v t and the discrete variable St represent 
respectively the observations and the underlying dynamics regime at time t. The joint distribution of all 
variables factorises af] 

T 

p{Vl:T, Si-.t) = p{vi | Si )p(si ) ]^[ p{v t | S t )p(s t | Si- 1 ) • 

t=2 

We assume a homogeneous regime switching, parameterized by a vector 7r and a matrix n such that 
p(s\) — n 31 and p(st\st-i) = n stSt _ 1 for t > 1. For the case in which vt is a continuous variable, the 
distribution p(vt\st) is commonly modelled as a Gaussian mixture. 

There are three main properties that can limit the modelling accuracy of HMMs, namely 

1. The strong independence assumption among observations vt -LL W\t=vi,...,« 4 _ 1 ,t> t+ i,...,u T | s t , with 
consequent lack of smoothness in modeling the observations. 

2. The weak modelling of the state-duration distribution: the probability of observing state i for d 
consecutive time-steps is implicitly given by the geometric distribution — no), which is often 
inappropriate and encouraging too fast regime switching. 

3. The limited power in modeling the observations when high noise is present. In particular, the 
HMMis unable to provide an estimate of the dynamics underlying noisy observations. 

In the next sections we describe extensions of the HMM that overcome these limitations. 



4 Extension of HMM to account for Dependency among Observations 



A simple way of relaxing the strong independence assumption vt -LL v\ t \ s t in the HMM is to introduce 
Markovian dependence among the observations. In the belief network, this is represented by adding links 
from past to current observations, as shown in Fig. [2] (a) for the case of first-order dependence. For 



fc-order dependence, the joint distribution can be written 



P(VV.T, SI-t) = Y\p( V t\st,V t -k:t-l)p(St\st-l)- 

t=l 

Models of this type include, for example, the Switching Autoregressive Model (SARM ) in whict ^l 
f(s t ,vt-k-.t- i) +% = a T v t-i + Vt (Vt being a Ga ussian or Gamm a noise vector) 

1990 19931, and extensions to a nonlinear interaction / Susmel (20001. 



Hamilton 



(1989 



In the sequel, we describe efficient routines for solving the two most common inference problems in 
these models, namely the computation of p(st\vi-.T), known as smoothing, and the estimation of the most 
likely sequence of states s*. T = argmaxp(si:T|fi : T). 



Smoothing with Parallel Routines 

The estimation of 7| 4 = p(st\vi-.T) can be obtained a^] 



7t* OC p(s t ,VX:T) = p(v t+ l:T\st,Vl t t^lf,Vt-k+l:t) p(s t ,V 1]t ), 

where we have used the notation p(vt+i:T\st,vift=^k~, vt-k+i-.t) to emphasise the independence relation 
v t+ i;T -LL ui:t_fc j {s t ,v t -k+i:t}- This relation can be assessed using any of the two methods explained in 
Section[2] For example, for k = 1 we have vt+i-.T -LL Vut—i j {st, v t }, which follows from 

2 We use x\-T as a shorthand for x\,. . . ,xt- 
3 We use the convention x T = for r < 0. 

4 The HMM can be obtain as a special case when k = 0, with the convention x t i. t n = 0, for t' > t". 
5 The distribution for vi-k has to be defined separately. 
6 The normalising constant p(vi-t) is estimated as J2 St a T • 
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Figure 2: (a) Extension of the HMM to account for first-order Markovian dependence among the observations, (b) Ancestral 
moralized graph of the belief network in (a). 



Method I: All paths from (any element of) t>i : t_i to (any element of) Vt+i-.r are blocked, as v t +i-T are 
reached by passing (1) from both St and v t , (2) from s t only, (3) from vt only (Fig. [2ja)). In cases 
(1) and (2), s t is a non-collider in the path which is in the conditioning set, and therefore condition 
1. is satisfied. In case (3), Vt is a non-collider in the path which is in the conditioning set, and 
therefore condition 1. is satisfied. 

Method II: The ancestral moralised graph is depicted in Fig. [2jb) . All paths from vi-.t-i to vt+i-.T 
have to pass through an element of the conditioning set s t or v t . 

The terms af and PI * can be recursively computed as follow^ 
af = p{v t \s t ,Vx&-te^r, v t -k:t-i)p(st,v 1:t -i) 

= p(Vt\s t ,Vt-k:t-l) ^2 p(s t \s t -l,Vl^f)p(st-l,Vl:t-l) 

= p(v t \s t ,V t -k:t-l) ^2 TTs^^ja^ 1 , 
st-1 

PV = ^2 P(vt + 1:T\?C Si + l,« t _ fc+ l: t )p(s t + l|s t ,U i _ W rr) 

s t + l 

= P( Vt + 2 -- T \ St + 1 > V J^*Fr,t-k + 2:t + l)p(vt + l\St + l,V t - k + 1 .. t )TT St + iat 

s t + l 

= 22 ^t+i 1 P(v*+i|st+i,Wt-fc+i:t)vr St+lst . 

st + l 

For k = 1 for example, the independence relation s t AL Vi-.t-i j s t -i can be demonstrated by noticing that 
all paths from vi-t-i to st reach st from (1) the non-collider s t -i which is in the conditioning set, (2) the 
collider v t which (together with all its descendants) is not in the conditioning set, (3) s t +i which imposes 
passing through a collider (e.g. v t +i) which (together with all its descendants) is not in the conditioning 
set. 

Notice that the a^* and fil* recursions can be performed in parallel, after which 7 t 3t is computed. 
The computational cost of these recursions is 0(TS 2 ) (without considering the cost of estimating terms 
such as p(vt\st, Vt-k-.t-i))- In order to avoid numerical overflow/underflow problems, computations are 
often preformed in log scale. 



Smoothing with Sequential Routines An alternative way of performing smoothing is to first 
compute the filtered distribution a 3 * = p(s t \v 1:t ) and then use this estimate to compute = p(st\vi-.T) 
as follows 

7 The initialisation is given by qJ 1 = p(vi\si)tt s1 and f3^f = 1. 
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st , , w , . lP = ^P{st\st+i,v 1 , t ,v M ^r)p(st + i\v 1 , T ) 

<H <XP(Vt\St,liL^k^r,Vt-k:t-l)p(St\vi:t-l) ^ 

= p(«t|Si,«t-fc : t-l) p(«t|st-lU&*«^P(*t-lKt-l) = V + ! K^P( S 'KO >t+l 

n-i Ef t p(st+i|st>JJ*iOp( s «l' u i-0 4+1 

= p(u t |«t,«t-A.:t-l) 5^T.,» t -i«Ji'i 1 , = 7T3 (+l3t ar „ t+1 

s t+1 Z^s t ^st+ist"* 

where the normalisation p(ut|«i:t-i) in the a** recursion is obtained by summing the rhs of the above 
over all values of St. Notice that we are not making use of the observations after the a^' have been 
computed. These routines do not require working in a log scale. 

In the following sections, we will focus on the parallel smoothing approach and come back a sequential 
smoothing approach in Section [7] 

Computing the Most Likely Sequence of Hidden Variables 

If we define <5f* = max si . t _ 1 p(si:t, Vut), the most likely sequence of hidden variables s*. T = argmax Si p(si : t\vi : t) 
can be obtained with the following algorithm 

^i 1 =p(s 1 ,v 1 ) = al 1 

for t = 2,...,T = p(vt\st,vt-k:t-i)ma,xTY atat _ 1 S s t t _~ 1 1 , i/j'* = argmax7r 3t . !t _ 1 <5 t s l~ L 1 

"t-l s t -l 

St = arg max S^F , for t = T — 1, . . . , 1 s* = ipt^ 1 
where the recursion for <5j f is obtained as the a*' recursion with the sum replaced by the max operator. 
Artificial Data Example 

In this section, we illustrate how the switching autoregressive model (SARM) can overcomes the limitation 
of the HMM in modeling the temporal structure of a continuous time-series. In this continuous case, the 
observations in the HMM are commonly modeled as mixture of M Gaussians, for which the smoothed 
distributions are given by 7 t Sf,mt = p(v t +i:T\st,p*f,Jte*t)p(s t , m t , vi :t ) with 

<V' =p{St,m t ,Vi : t) 

= p{v t \s t ,m t ,vi i t^)p(m t \s t ,V2^T) ^ p(s t \s t -i,Vi<t^r)p(st-i, v\-t-i) 

= p{fit\st, m t )p(mt\st) ^ p(st\s t -x) }^ p(s t -i,mt-i\vi: t -i), 
which has computational cost 0(TS(M + SM)), and 

Pt* = p(vt+l: T \st) 

= ^ P(. v t+2--T[?Cst+i,ju*r) p{vt+i[?{,st+i,mt + i)p{mt + i\X,st+i)p{st+i\st) 

St+1 mt + 1 

which has computational cost 0(TS 2 M). 

We generated a 200 time-step-long time-series, corresponding to a noisy sinusoid with switching 
frequency at t = 100. The time-series with the correct segmentation represented by different colours is 
plotted on the top of Fig. [3] 

We then learned the parameters of a HMM with observations modeled as mixtures of M = 3 Gaus- 
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200 



Figure 3: Top panel: the generated time-series up to the first 20 regime switches. Bottom panel: the sequence of (red) underlying 
and (gray) estimated regimes as arg max S( 5Zc t Tt* ^' ie intensity of the gray indicates the value max st ^2 ct 7t*' c * (darker 
colour means greater value). 



sians, using an EM algorithm where the M-step updates are given by 

Ef=i v t p(s t ,rn t \v 1:T , 9 1 " 1 ) 



ELlP( S t I m i|«l:T,0 4 " 1 ) 

EtliW ~ M»t,m t )(wt - Mst,m t ) T p(st,mi|wi :T ,0'" 1 ) 



J2^iP( s t,mt\v 1:T 

p(si) = p(si\VUT, 1_i ),p(s t |Si_i) 



p(m t |s t ) 



E t = 2 E 3t p(^-i:*ki:T,e-i) 



E t =iE mt p(s t ,m t |« 1;T ,e-i) 



The resulting segmentation is given in the middle of Fig. [3] As we can see, the HMM splits the time-series 
into two regimes, one corresponding to higher values and the other one corresponding to lower values 
of the time-series. It is clear from this example, that in order to obtain the desired segmentation more 
temporal structure needs to be encoded into the model. One solution would be to add a link from m t -i 
to rn t , which would give an a routine 

= p{vt\s t ,m t ,]rut^r) ^2 p(m t -i;t,st,vi:t-i) 

"H_i 

= p(v t \s t ,m t ) ^2 p(m t \m t -i,st,V2^r) ^ p(m t -x, St-i-.t, Vi-.t-i) 
mt-i s t -i 

= p(vt\s t ,m t ) ^2 p(mt\m t -i,st) ^ p(s t \s t -i,rjit^,jii*&=r)p(st-i,rnt-i,Vi : t-i) 

mt-i a t _ 1 

= p(v t \s t ,m t ) ^2 p(m t \mt-i,s t ) p(s t \st-i)a% 



s t _i,m t _i 
H-l i 



which has computational cost 0(T S M (S + M)) . This therefore would increase computational complexity 
considerably for complex time-series in which M needs to be high. A less expensive alternative is to use 
a SARM, which. After learning the parameters similarly to the HMM case, SARMgives the desired 
segmentation as shown in the bottom of Fig. [3] 

5 Extension of the HMM to Model Explicitly the Regime-Duration Distri- 
bution 

A way of overcoming the limitation of the implicit geometric regime-duration distribution of HMMs, is 
to define a different distribution by introducing extra random variables, at the price of increasing the 
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Figure 4: HMSMs in which the regime-duration distribution is explicitly modelled using a set of discrete variables c\ : x (a) as 
in Section |5.1.1| and (b) as in Section |5.1.2| In (b) the visible dependence across regimes (from time t to time t + 1) is cut (red 
dashed lines) (ct+i = 1 indicates the start of a new regime at time t + 1). 



computational cost of inference. This idea was first presented in Ferguson ( 1980 1 and largely followed in 



the speech Murphy (20021; Ostendorf et al. (19961; Rabiner (19891 and statistics communities Guedon 



120031; Sansom and Thomson| ( |2001| ). There are two main ways to achieved this, which differ by using 
one or two sets of discrete hidden variables respectively. In the sequel we introduce them and discuss 
their relative properties. 

In the exposition, we will pay particular attention to the calculation of the effective computational 
cost of the inference routines, with the goal of clarifying incorrectness and misunderstanding of the 
current literature. 



5.1 Modelling the Regime-Duration Distribution with One Set of Duration-Count 
Variables 

When using one set of discrete hidden variables ci.t for defining an explicit regime-duration distribution, 
we can think of two different approaches for modeling such variables. In the first approach, c t provides 
information about the time-step r > t at which the current regime ends, which gives rise to decreasing 
duration-count variables within a regime. In the second approach, c t provides information about the 
time-step r < t at which the current starts, which gives rise to increasing duration-count variables within 
a regime. These two approaches and their properties are discussed in detail in the sequel. 



5.1.1 Decreasing Duration-Count Variables within a Regime 

In the first approach to explicit modeling of the regime-duration distribution using decreasing duration- 
count variables within a regime, at a beginning of a regime t, c t indicates the number of time-steps 
(duration) spanned by the regime sampled according to a regime-duration distribution. At subsequent 
time-steps, the duration-count variables take progressively smaller values, until reaching value 1 at the 
end of the regime. More specifically, using the notation at = {s t ,c t }, we define p{o\ : t) ~ Ylt p(ft|f t — l) = 
ll t p(st\st-i, c t _i)p(c t Jc t _i) with 

^s t s t _ 1 if ct_i = 1, , , , J Pc t if ct_i = 1, 

d st = St _ 1 llCt-l>l, (^Oc^ct.j-l llCt-l>l, 

where S is the Dirac delta and p specifies the state-duration distribution. By considering a distribution 
that is zero outside the interval {d m i n , • . . , dmax}, we can impose constraints on the minimum and (if 
TTa — 0) maximum duration allowed for a regime. We consider the case of fc-order Markovian dependence 
among the observations, and therefore obtain a belief network representation of the model as in Fig. [4] 
(a) (for the case k — 1). 



(St\St-l,Ct-l) 
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Smoothing with parallel Routines 

To estimate 7 t CTf = p(a t \vi-.T) oc p(vt+i-.T\crt, Vt-k+i:t)p(<Tt,Vi:t) = P^&T we can use a similar approach 
to the one described in Section H] SpecificalljQ 

«t* = p(v t \St,fif,Vl^-- ft^T, V t -k:t-\)p{&t, Ul :t _l) 

= p(«t|st,«t-fc:t-i) 2^ p(o"tki-i,Jittt=T)a^ 1 

= p(ui|s t ,u t _fe:i_i)|(5 ct < dmax ati'i t+:L +p ct ^^tst-^t'Si' 1 ^, 

"t-i 

where, by exploiting the fact that Ct < dmax implies that either c t _i = c t + l, s t -i = s t or c t -i = 1, and the 
fact that ct = dmax implies that c t _i = 1, we reduced the cos f]from 0{TS 2 d^ x ) to 0(TS 2 d m ^). Further 
saving may be obtained by pre-computing X) s _ 7T atSt _ 1 a s t t J~i' 1 , giving a complexity of 0(TS(S + dmax))- 
This cost is however more expensive than the cost 0(TS 2 ) of the model in Section[4] 
The term can be obtained as follow^] 

fit* = P(^ + l:T|X,^i + l,"t-fc + l:Op( CT i + lki.Jt-*-f=rrr) 

= ^2 P( v t + 2.T\^t + l,Vj^--kri, v t-k+2:t+l)p{vt + l\St + l,fy^rC, Vt-k+l:t)p(&t+l\<Tt) 

"max 

= S ct=1^2p( v t+l\St+l, Vt-k+l:t)n et+1 , t ^ Vc t+ i +<5 Ct >lP(«t+l|st+l, Ut-fc+l:t)/3t+i t_1 , 

+ l c t + 1 =d min 

where, by using the fact that ct = 1 implies change of regime at time t + 1, whilst c t > 1 implies 
Ct+l = Ct — 1, s t+ i = St, we have reduced the cost to 0(rS'd max ). 

Most Likely Sequence of Hidden Variables 

With the notation 5%* = max CT1 . t _ 1 p(ai :t , vi-t), the most likely sequence a{. T = arg max (Ti p(<ti : t|vi : t) 
can be obtained as 

81 1 = p(a 1 ,v 1 ) = al 1 

p{v t \s t ,vt-k:t-i) max[<5 t s i^ f+1 , p ct max St _ 1 ir stSt _ 1 5" i r 1 1 ' ] if c t < d 



at 



5° 



W = 



max 

St-l ,1 



p(vt\s t ,vt-k:t-i)pc t max Si _j irstat-i^t-i' otherwise 

{argmax S( i ■Ka u a%-i^i' > 1} if /°c t maXs^ n^s^a^ 1 ' > <5jl'i* +1 or c t = d max 
{st,Ci + l} otherwise 



crj = arg max <5J T , for t = T — 1, . . . , 1 crj* = V't+l 1 ■ 
Artificial Data Example 

There is considerable evidence in the literature that extensions of the HMM to explicitly model the 
regime-duration distribution give improvement in performance in several real-world problems such as 
handwriting and activity recognition, speech, MRI, financial time-series, rainfall time-series, and DNA 



analysis. Some specific examples can be found in Yu (20101. In this section, we give a simple illustration 



of this property for the switching autoregressive model (SARM), using artificial data. 



8 The initialisation is given by a" 1 = p(?;i|si)7r 31 p(duration > ci) = p(vi |si)7r si y"]^" 1 "* Pi- Notice that this way we are not 
imposing that the first regime starts at the first time-step 4 = 1. 
9 We do not consider the cost of computing p(v t \s t , v t — k-.t—i)- 
10 The initialisation is given by /3J, T,CT = 1. 
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Figure 5: The generated time-se ries up to the first 20 regime switche s, indicated by different colours. 




Figure 6: The empirical (orange), the GSARM (blue), and USARM (green) regime-duration distributions for regime 2. 

We generated a time-series from a 3-regime, 3-order SARM vt = X^i=i a \ tv t-i + r\t with 
a = (1.8,-0.99,0), a = (1.65,-0.9,0.1), a 3 = (1.8, -0.85, 0); r\ t ~ A/"(0, 1). 



The time-series contains 100 regime switches of type and duration uniformly sampled between 1 and 3 
and between d m i n = 30 and d max = 50 respectively, giving a total length of T = 4004. Notice that this 
problem is quite hard, since the autoregressive coefficients for the three regimes are similar. In Fig. [5] 
we show the generated time-series up to the first 20 regime switches, indicated by different colours. 

As a measure of segmentation error, we used the discrepancy between the correct and the estimated 
sequence of regimes (obtained as argmax Jt ~l2 ct 7t' t ' ct f° r the case of smoothing). For the SARM with 
geometric regime-duration distribution (GSARM), we used the maximum likelihood value of the regime- 
switching distribution n on the same sequence computed assuming that the other parameters and the 
segmentation are known. For the SARM with explicit regime-duration distribution (USARM), we used 
a uniform regime-duration distribution in the interval {30,...,50}, and 7r« = = 1/(S — 7^ i. 

In Fig. [6] we plot, for regime 2, the empirical regime-duration distribution of the time-series (orange) 
and the geometric (blue) and uniform (green) distributions used in the GSARM and USARM. 

We have performed two experiments. In the first, the autoregressive coefficients and the Gaussian 
noise variance were assumed to be known, and therefore we measure difference in performance due to 
inference only. In this case, the GSARM gave a segmentation error of 0.16% for the case in which 
smoothing was employed and of 0.31% for the case in which the most likely sequence of regimes was 
estimated, whilst the USARM gave a smaller error of 0.07% and 0.08%. In the bottom panel of Fig. 
|7]we show the sequence of (red) underlying and (gray) estimated regimes as argmax st ^2 ct 7t St ' Cf ■ The 
intensity of the gray indicates the value max st 7 t Sf ' c * (darker colour indicates greater value). From 
this figure, we can observe that the GSARM is in general more uncertain about the regimes than the 
USARM, particularly for regime 2 to which portions of time-series belonging to other regimes are often 
assigned. 

In the second experiment, the autoregressive coefficients and the Gaussian noise variance were assumed 
to be unknown, giving rise to a more difficult task. These parameters were estimated by the EM 
algorithm, using as initialisation 

a 1 = (0.8,-0.99,0), a 2 = (-0.65, 0.2, 0.1), a 3 = (0.9, -0.35, -0.3); r/t ~ Af(0, 100). 
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Figure 7: Top panel: the generated time-series up to the first 20 regime switches. Bottom panel: the sequence of (red) underlying 



and (gray) estimated regimes as argmax S( 
colour means greater value). 



It 



The intensity of the gray indicates the value max st 7t* (darker 



In this case, the GSARM gave a segmentation error of 0.15% for the case in which smoothing was 
employed and of 0.28% for the case in which the most likely sequence of regimes was estimated, whilst 
the USARM gave a smaller error of 0.07% and 0.07%. 

5.1.2 Increasing Duration- Count Variables within a Regime 

An alternative model for the duration-count variables ci : t to the one described in Section [5. 1.1| would 
be to set c t = 1 at a beginning of a regime t, and define progressively increasing duration-count variables 
at subsequent time-steps within the regime. Specifically 




p(st\s t -i,c t ) = { "* tSt 1 liCt ]' p(c t \c t -i)-- 

t =s t -i ltCt>l, 

The relation between this model for cx-t and the one in Section [5. 1 . 1 1 is given by 

1 - A Ct = p Ct I }] pi. 

i=<H 

This can be seen as follows 



if C t = C t _! + 1, 
if c t = 1, 
otherwise, 



Pd=p 



(duration = d) = (1 — A<j) \d- 



, A 2 Ai = (1 - A<j) ^2 p% ■ 



p(duration>d) 



This alternative model for c\.t is useful to derive efficient inference routines, for example, for the case 
of Markovian dependence among observations in which we want to define independence across regimes 
(these types of models are often called change-point models). This can be achieved by adding a link 
from c t to v t in the belief network (see Fig. [3] (b) for the case of order k = 1). Indeed, in this case the 
variable c t encodes information about the time-step at which the current regime started and therefore 
about when to cut past dependence, i.e. p(vt\vt*-krt=cf, Vt- ct +l:t-l, «t, Ct), k > c t . This would not be 
possible with the model of Section |5.1.1| in which at time t information about when the current regime 
started is not available. Notice that, in this across-regimes independence case, it makes sense to have 
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Ct-1 =2 ct = 1 c,+l > 1 



ct-i = 2 ct = 1 Ci+i > 1 





Figure 8: HMSMs in which the regime-duration distribution is explicitly modelled using two sets of discrete variables c 1: y, d\ 
as in Section |5.2| In (a) the visible dependence from time 1, . . . ,t to time t + 1, . . . , T is cut (red dashed lines), since ct = 
indicates the end of the regime at time t. 



tth 7^ 0. The a recursion is given bjF^l 

a t* = P(«t|ct,J£W— ft=T, V t -k:t-l)p(&t, Vl:t-l) 

= p(v t \cTt,Vt-k:t-l) ^2 p{vt\<Jt-l,V3^r)att~ 1 1 
"t-l 

= p{Vt\(T U Vt-k:t-l)\5 ct> lXct-ial-l _1 +<5c { =l ^ (l-Act.J ^ Tst^-iQifll 1 }, 

Ct-l=d m in s t-l 

where p(ut|crt, = p(-u t |<7t, v t - ct +i:t-i) if fc> c t . The computational cost of this recursion is given 

by 0(TSdmax), namely the cost of the /3 recursion in the previous model. 
The P recursion is obtained as follow^] 

P? = P( V t+l:r\p{,Crt+l,Vt-k+l:t)p{^t+l\(Tt,Vi. 



= ^2 p(vt+2:T\a t +l, Vt~&fl, Vt-k+2:t+l)p(v t +l\<Tt+l,V t -k+l:t)p(crt+l\<Tt) 
CT t+l s , 



+ ^c t <d max A ct p(ut+i|st, Ct+1 = C t + 1, Wt-fc+l:t)^t4-'l* 

When pre-computing ~}2 St p(v t +i\s t +i, c t+1 — 1, Vt-k+i:t)^s t+1 stPt+i' > the computational cost of this 
recursion is given by 0(TS(S + d max )), namely the cost of the a recursion in the previous model. 

5.2 Modelling the Regime-Duration Distribution with Two Sets of Duration-Count 
Variables 

In this section, we describe how to model the regime-duration distribution using two separate sets of dis- 
crete variables d\-T and ci-t- The duration variable d t specifies the number of time-steps spanned by the 
observations forming the current regime, and takes a value sampled from a regime-duration distribution. 
The count variable Ct is modelled similarly to Section [5. 1.1 1 with the only difference that, at the beginning 
of a regime t, Ct = d t instead of being sampled from a regime-duration distribution. More specifically, 
by using the notation a t = {s t ,d t , c t }, we define p{a t \a t -i) = p(ct\d t , c t -i)p(d t \d t -i, c t _i)p(s t |s t _i, Ct-i) 
witfE] 
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1 The initialisation is given by a^ 1 = p(vi|si)7r 31 p(duration > ci) = p(vi\si)n si Yi^Li 1 Pi- Notice that this way we are no 
imposing that the first regime starts at the first time-step t = 1. 

12 The initialisation is given by /3^ T = 1. 
13t 



3 For t = 1, p(si) = w B1 ,p(d 1 ) = p dl ,p(ci\d 1 ) = 8 ci=dl . 
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, , Ipdt lfCi_l=l, J5c t =d t lfc t _i = l, 

p(dt\d t -x,ct-i) = i p(c t \d t ,ct-i) = i 

" Od t =d 4 _i lIC t _i>l, lOcj^-i-l lICt-l>l, 



and p(s t St-i, Ct-i) as in Section 



5.1.1 



The variables dt and ct provide information about the time-steps in the past and future at which the 
current regime starts and ends. Therefore, unlike the previous models, this model can be used in the 
case of a non-Markovian visible dependence within a regime. This is expressed by the undirected links 
among the observations in the graphical model of Fig. [8] (a). Notice that, both visible independence 
across regimes (Fig. [8] (a)) or Markovian dependence across regimes (Fig. [8](b)) can be assumed. In the 
first case, these models are usually called segmental HMMs. 

In the sequel, we provide inference routines for the case in which independence is assumed. 



Smoothing 



In order to perform smoothing, we first compute terms of the type 7t t,ct ' = p(st,d t ,c t = 1|ui : t) for 
which the count variable takes value 1. If we define a\ — {s t , d t , c t — 1}, we obtaii 



Isl 



It 



at,d t ,l 



OC p(vt+l-.T\St,C t = l,sUrW£t) P(<Tt ,Vl:t) 



The term a^' 111 ' 1 is estimated as followj^j 

Q st.dt,l = p(y t -dt+l:tWt,S>^*^)p(. a t,'"l^-dt) 

, | In V* / 1| 1 » s t-d t ' d t-d t ,1 

= P{Vt-dt+l:t\<Tt) 2^ P( CT tm-d t ,vx^sT)a t -d t 

s t-d t < d t-d t 

I i 1\ V* S t-d t > d t-d t .1 

= p{Vt-dt+\:t\<Tt)pd t K'ft-dt 1^ a t-dt 

s t~d t d t-d f 



- s t — df> 
t-d t 



Whilst the complexity would appear to be 0(TS ,2 (d max — d m in + l) 2 ), by pre-computing a t l^ *' and 
®t-d t ~ 5^s f d ' Ks t s t-d t ^t-dt' 1 ' we obtain a complexity of 0(T5(S' + d max — d m i n + 1)) (without consid- 
ering the complexity required to compute q{vt-d t +i-.t\o~ t )). Notice that this complexity is smaller than 



14 In this model it makes sense to have na ^ even when a minimum regime duration is imposed. 

15 The normalisation term p(fi ; r) can be computed by summing Eq. ^over St for a specific t, or as ^Zs T ^ T o^p T ' dT ' 1 when 
imposing the contraint cy = 1. 

16 The initial and final recursions are as follows 

| p(yi:t\<rt)n*tPdt if d t = t 

for t = 1, e£ m j n a**' = \ if dt > t & constraints cq = 1, d m i n = 1 

p(vi:t\<Tt )^" t Pd t if dt > t & no constraints co = l,d m i n = 1 

p(vi:tWl)ftstPdt H d t = t 
, . if (it > t & constraints cq = 1 , d m i n = 1 

for t = d m i n + 1, . . . , cZ max a t = < p(i; 1:j; |<7jl )7fs t Pdi if fit > i & no constraints Co = 1, d m in = 1 

p(vt- dt+1 ..tWi)pd t T. 3t _ dt K'tH-dt E dt _ dt ' t_dt ' if d < < * 

for t = c? max + 1, ■ • • ,T a t tAul =p{v t -d t +i:tWl)Pd t Yl n "tH-dt Yl a t-dt*' *~ dt ' 

B t~d t d t-d t 

for t = T + 1, . . . ,T+ d m ax - 1, for d t = max(d min , t - T + 1), . . . , d max 

s t ,d t ,l / I Is s t-d t > d t-d t .1 

<V =P(«t-d t +l:T|0- t )pd t 2^ 7r »t«t-d t 2^ Q i-d f 

s t-d t d *-d t 

and pip 1 = 1 . Under the constraint cq = l,ct = 1 with d m i n > 1 , extra care to ensure consistency at the beginning and 

end of the sequence is necessary. The case co ^ 1,ct ^ 1 requires the definition of additional distribution for partial-regime 
observations. 
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Q(rS 2 dLx) and Q{TS 2 d m ^) (c ase d min = 1) stated in |Mitchell et al.| ( |1995| ); |Murphy| ( |2002| |; |Rabiner 
( |1989[ ); |Yu and Kobayashi] ( |2003[ ) for the same approach 



With the notation cr t _j_ fe = {st+k, d t +k = Ct+fc = 1}, the term ft 4 ' 1 can be estimated with the 
following recursion 

ft 8 *' 1 = P( v W--r\cr^ k , s t , c t = l)p{a^ k \s t , c t = l) 

which, by pre-computing p(«t+i:i+fc|<7- t 1 | fe fc )/3 s ^ fc ' 1 pfc, has complexity 0(TS(S + d max - d min + 1)). There- 
fore, the a — f3 routines with separate duration-count variables are computationally convenient over 
common duration-count variables when d max — dmin + 1 "C d max . 
Distributions of interest such as p(s t , Ct\vi-.T) can be estimated as 

<*max 

p(s t , c t \v 1[T ) oc ^ Pt+l t -i a t+t-i 

d— max(d m j n ,Cf ) 

p(s t \v 1:T )K £ ffV*'"' 1 . (i) 

t — t d— max(d m i n ,r — 

The computational cost for p(s t , ct|vi : r) is given by 0(TSd m i n (d 

max t^min + l) + (d max (*min 

)(d max 

^min + l)/2). Indeed for Cf = 1, . . . , <2 m in we have c/ max — ^min + 1 summations to perform, whilst for 
+ 1, . . . , d max we have d max - c t + 1 summations to perform, which gives rise to 1 + 2 + ... + 
d max - dmin = (d max - d min )(d max u ram 

+ l)/2. 

Case 7r,i = 0. In this case, it is more convenient to compute p(s t \vi-T), using the recursion for 
"t 4 ' 1 ' 1 = Y^d t P( v t-d t +i:tWt)Pd t & s t t:'al which has complexity 0(TS(S + d max - d min + 1)), and then 
calculate 

P(*|»1:T) = £ fi^ct"*'" 1 ~ fi^O?*' 1 • 
r<i 

Most Likely Sequence of Hidden Variables 

If we define 5^' t ' 1 = max si . t _ 1 , c i 1 . t _ 1 p(si-,t-i, di:t-i, , i>i : t), then the most likely sequence of states, 
duration and count variables s*, T , d*. T , c*. T (under the assumptions = 1) can be obtained as 

for t = 1,.. . ,d min <5 t St ' dt ' 1 = Q^.d** 1 



for t = dmm + 1, . . . , d max 5 t st ' d *' 1 = 



• dt ' 1 if dt > t 

p(vt-dt+v.tWt)pdt raax St _ dt TT st s t _ dt max dt _ dt tf t t£' *~ dt ' ifd t <t 



fort = d ma x + l,...,T 



<V =P{vt-d t +i:t\a t )Pd t riiax 3t _ dt TT 3tat _ dt ma,x dt _ dt 5 t _ dt t 



max "I" X, J dt ^t-dt'^-it' 1 



[ST,d T ,C T ] 



{argmax ST dr <5y T ' T,:L , 1} if contraint ct = 1 
argmax ST dT Ct a^T' dT ' CT if no contraint ct = 1 



Sr-d*+c*:T-l — ST, d T -d* +c* :T-1 — dn CT-d» +c* :T-1 — d r , . . . , C T + 1 



{s t *,d t *, Ct *} = {^+ 1 ' t+1 ,l} 
t = T - dr + c* T - 1, while * > 1 sj_ d . +1:t _ 1 = s t *, d^.+^.i = d t *, <_ d . +1:t _ 1 = d t *, . . . , 2 

t = t-dt 
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Sampling a Sequence of Hidden Variables from p(<tut\vi:t) 

Such samples can be obtained recursively by considering the factorisation p(o\;t\v\:t) = p{ot\v\:t) Y^=\ p{?t\ ff t+i, vv.t) 
Suppose that, at time t, c t = 1 and we have sampled regime type s t and a duration d t . Then, 
s t -d t +i:t~i = s t , d t - dt +i:t-i = d t and c t -d t + i:t-\ = d t , . . . , 2, c t _ fc = 1, so that effectively we need 
to sample st-d t ,dt-d t from the distribution p{crl- dt \o t -d t +\, «i:t, jit-nrf), which is given by 



P(<?t-d t \o~t~d t + l,V 1: t + l) = 



P(P't-d t , ( ?t-d t + l,Vl:t) 
P((?t-d t + l,vi:t) 

s t-d t 

_ n s t s t _ dt O! t _ dt 

X/j T^s t i Xli a t-d t 

Note on Complexity. It is clear from the complexity analysis described above that whether to use an 
approach with joint or separate duration-count variables depends on the specific goals and assumptions. 
If we are interested in estimating the most likely sequence of hidden variables or sampling a path only, 
then if d max — dmin + 1 <?C dmax, using separate variables is more convenient. However, when interested 
in distributions such as, then one has to take into account the extra cost incurred in computing these 
distributions. 



6 Switching Linear Gaussian State-Space Models 

In Sections [4] and [5] we have analysed how to improve the basic HMM model by enabling dependence 
among observations and by explicit modeling of the regime-duration distribution. In this section, we intro- 
duce another extension of the HMM called the Switching Linear Gaussian State-Space Model (SLGSSM), 
which enables a more accurate modeling of noisy and continuous observations and the reconstruction of 
the hidden dynamics underlying a sequence of noisy observations Mesot and Barber (2007); Quinn et al. 
< 2008b; iZoeterl (120051. 



The SLGSSM is defined by the following linear equations 

h t = A st h t - 1 +v't\ n h t ~JV(»h fc ;0,Eg), hi 



>M(h 



i; A» 



where the continuous hidden variables hi-.r represents a hidden dynamics. The joint distribution of all 
variables factorizes as follows 



p(«i:T, h 1:T , s 1:T ) = p(«i|fti, s 1 )p(hi\s 1 )p(si) Y[p(vt\ht, s t )pQh\ht-i,st)p{st 



St-1 



Performing inference in the SLGSSM is intractable since, e.g., p(ht\st, Vi-.t) is a Gaussian mixture 
with S 1 ' -1 components. Several approximation schemes have been introduced over the years in order to 
deal with this intractability issue. For filtering, a successful method consists of employing a Gaussian 
collapsing procedure in a filtering routine on the line of the standard LGSSM predictor-corrector filtering 



routine. For smoothing, Barber ( 2006 1 showed how, by introducing some approximations, it is possible to 



use Gaussian collapsing in a Rauch-Tung-Striebel style smoothing routine, obtaining improved accuracy 
over other approximation schemes. 



7 Extension of SLGSSM with Explicit Regime-Duration Distribution 

As in the classical HMM, the regime-duration distribution of the SLGSSM is implicitly geometric. There- 
fore the SLGSSM shares the same limitations of the HMM in this respect. In this section, we show how 
apply methods analogous to the ones described in Section [5] for explicitly modeling the regime-duration 

distribution of the SLGSSM. 

We will use a Gaussian collapsing approximation method on the line of Barber ( 2006[) . Extensions 
of the SLGSSM for defining an explicit duration distribution have also been studied in |S. M. Oh and] 
Dellaert (20081. The authors introduce a set of duration-count variables c\-t as in Section 5.1.1 and 
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ct-l = 1 



c f = 1 




Figure 9: Extension of a sLGDS with duration-count variables. 



use standard approximate inference methods for the SLGSSM in which the discrete hidden variables is 
a merged variable {ct,st}. Sparsity in the transition distribution of the merged variables is then used 
to reduce computational cost. The advantage of maintaining separate variables is to get insight into the 
problem and properties of the approach. As we will see, this will enables us to derive exact inference for 
the special case in which independence among observations is imposed. 

7.1 Modeling State-Duration Distribution with One Set of Count-Duration Vari- 
ables 

In this section, we describe how define an explicit state-duration distribution for the SLGSSM by using 
one set of count-duration variables cu as in Section [5. 1| 

7.1.1 Decreasing Duration-Count Variables 

The first approach is by modeling the duration-count variables as in Section [5. obtaining the belief 
network representation in Fig. [9] (a). In the sequel, we will describe how to perform smoothing in 
this model. Unlike the models explained in the previous sections, this will be achieved with a sequential 
approach that first computes the filtered distributions p(ht, St\vi-.t) and then uses this estimate to compute 
p(ht, s*|wi:t)- This approach is preferable, since working in a log scale is not possible. 

Filtering 

An approach to filtering is to form the following recursion for p(ht\crt, Vi-.t) 

p(ht\cT t , Vl-.t) = p(h t \crt-l,St,fiC, Vl:t)p(&t-l\cTt,Vl:t) 

= 5 ct <ti max p(ftt|st-l = St, Ct-1 = C t + l,S t ,Vl:t)p(st-l = St, Ct-1 = <k + l|ct,Wl:t) 

+ ^ p(h t \s t -l, Ct-1 = l,St,Ul : t)p(st-l,Ct_l = l|<T t ,«l:t) ■ (2) 

st-1 

In order to understand the complexity of this recursion, suppose for simplicity that d m i n = 1. Then 
p(hi\ai,Vi) is a Gaussian and, from the recursiorj^] we see that p(h,2\(T2, Vi:2) is a mixture of Gaussian 
with S components. Therefore, in general, p(ht\o~t, vi-.t) is a mixture of Gaussians with 5 1 * -1 components 
as in the standard SLGSSM. In order to overcome, this exponential explosion of component with time, 
after estimating the recursion, we collapse the obtained mixture to a single Gaussian. 
Below we describe how to compute each required term in the recursion. 

17 We use the fact if p(ht-i\crt-\, «i:t— l) is a mixture of Gaussians, then p(ht\at— i, St, Vut) is a lso a mixture of Gaussians 
with the same number of components (see below). 
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Computing p(h t \a t -i, St, v\- t ). We have 

| x p(v t ,ht\(Tt-l,St,Vl:t-l) 

P{h t \CTt-l,St,V 1:t ) = : — 

P(v t \crt-i,St,v 1 ..t-i) 
_ p(vt[s*^r, s t , h t ,v iit ^r)p(h t \a t - 1 , s t , v 1:t -i) 
p{vt\a t -i,st,vv.t-i) 

p(Vt\s t ,h t ) j ht i p(h t -l:t\(Tt-l,St, Vl:t-l) 

p{vt\cr t -i,s t , Vl-.t-l) 
p(v t \s t ,h t ) f h p(h t \h t - 1 ,o^r, s t ,Vj rl t^r)p(ht-i\(Tt-i,^{, v 1:t -i) 

p(v t \(Tt-l,St,Vl:t-l) 

If we assume that p(ht-i\<Jt-i, «i:t-i) is a Gaussian distribution with mean fi t _\'' Tt ~ 1 and covariance 
P*_ 1 1,<Tt_1 , by using the linear system equations defining the evolution of the continuous hidden variables 
we find that p(ht\at-i, St, vi-.t-i) is Gaussian with mean and covariance given by 

f t— l,<r t _i,« t _ ;i \ _ <s t Ji-l,^t-i 

<H — \' l */p(ht|CT t _i,s t ,u 1:t _i) — n t-l 

^t -{(nt-n t ){h t -n t ) ) P (h t \<y t _ l , st ,v l ., t _ l ) 

By using the linear system equations defining the observation-generation process we find 

In, \ — f> s tt t ~ 1 '"t-l, s t 

\ t, t/p(«t|o-t_l,«ti»l:t-l) — t 

{(vt - {v t ))(v t - {vt)) T ) pM a t _ 1:St , Vl:t _ l) = 5 s * i^' -1 '"' -1 ' 8 ' (B st ) T + E- 

By using the formula of Gaussian conditioning, we find that p(ht\cr t -i, St, v\,t) is Gaussian with mean 
and covariance given by 

ft,**-!,.* = ft-i,<r t -i,. t + K{?}t _ B ^jt-^t-i,s t) 

pti<Tt — l> s t pt— l)Ct-l >S( _ j£ps t pt—l,S t -l,St,C t — l 

where if = p* -1 '"*- 1 ' 8 * ^st^pstp* -1 '""*-!' 8 * ^ s t-jT + 
Computing p(cr t _i|cr t , We have 

p(Wt,<Tt|«l:t-l) 



p{<Jt\vi:t) = 



P(«t|vi:t-l) 
OC ^ p(Vt,a t -l:t\vi:t-l) 

= ^ p(w t |fT t _l, S t , ■yi :t -l)p(c t |c t _l)p(s t |lT t _l)p(crt_i|wi :t _i) 

= 5 Ct <d max p(«t|Si-l = S t ,C t -l =C t + 1, Si, Vl:t-l)p(s t -l = S t ,C t -l = C 4 + l|wi :t _l) 

+ y^Ppt|st-i,c t -i = l,s t ,wi :t _i)p ct 7r st3t _ 1 p(s t _i,c t _i = l|vi:t_l), 
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where p(Pt|g t -i,*t,Pi,t-i)= , t _\ = , e~2^-K ' m ' > (<*-** ' ' ). There- 

\/dctP t ' * 

fore 

/ I \ P(crt-l:t\vi:t) 

p{a t -i \at,vi :t ) = -. — j 

p(«t,ft-l:t|«l:t-l) 



p(«t|«l:t-l)p((Tt|Wl:t) 

p(vt\(Tt-l, 8t,Vl:t-l)p(Ct\ct-l)p(s t \(Tt-l)p((7t-l\vi:t-l) 
p(vt\Vut-l)p(o't\Vl:t) 



(3) 



Gaussian Collapsing. We can collapse the mixture of S Gaussians p(ht\crt, vi-.t) to a Gaussian by 
moment matching 

h\ at = 5 ct <d max p(s t -i = s t ,Ct-i =ct + l\a t ,v 1:t )h t t ' at ' Ct+1 ' at 

+ ^ p(s t -l,Ct-l = l|<Tt,«l:t)ftJ'** -1 ' '** 

F*' CTt =p( St _! =fl t ,Ce-i =c t + l|cTt,Wi:t)(P t Mt - Ct+Mt +^> C *+ 1 ' S «(^' S '. C '+ 1 > S «) T ) 

+ 2^P( S t-l! C *-l = l|<?t.Wl:t)0Pt (ft t ) ) - h t t {h t ' *) . 

The computational cost of this filtering recursion is 0(TS' 2 (d max — d m in + 1))- However, notice that 
if we plug Eq. [3] into Eq. [2] we obtain 

p(h t \at, Vi-.t) 

= <5c t <d max p(ftt|st-l = St, Ct-l = Ct + l,St,«l:t)p(s*-l = S t ,C t _l = C t + l|cXt,Vl:t) 

H 7 j 1 S ^ p(ht\s t -l,C t -l = l,S t ,Vl:t)p(vt\St-l,C t -l = l,St,V 1 :t-l)-K stSt _ 1 p(at-l\v 1:t -l) 

p{Vt\Vl:t-l)p{(Tt\Vl:t) ^ 

s t — l 

Therefore one sum only over st-i for all Ct is required and the cost reduces to 0(T(S(d max — d m i n + l) + S')). 

Cut of dependence when changing regime. Notice that in this model, we can introduce cut 
of dependence when changing regime by adding a link from ct-i to h t as in Fig. [9](b), and therefore in 
principle without the need to use a modeling for c\ : t as in Section [5.1.2| In this case 



p(h t \h t -i, s t , c t -i) 



AA(ft t;A t 3t ,E st ) ifct-i=l 
M{h t ;A at h t -^^) ifct_i>l 



Therefore, the filtering recursion becomes 

p(ht\(Tt,Vl:t) = ^ p(h t \cTt-l,St,fiC Vl:t)p(<Tt-l\(Tt,Vl:t) 
"1,-1 

= Sc t <d n , ax P(ht\st-l = S t ,Ct-l =Ct + l,S t ,Vl:t)p(st-l = St,Ct-l =Ct + lWt,Vl: t ) 

+ ^ p(h t [s^T, Ct-l = 1, S t ,JUitt=rr, Vt)p(s t -l,C t -i = l|(Tt,«l : t) 
st-1 

= <5 ct <d max p(ftt|s t -i = S t ,C t -i = c t + l,s t ,vi:t)p(st-i = St, Ct_l = C t + l\<Jt,V\-t) 

H 1—\ T~\ >^ P(S*-1, Ct-l = 1 Wt, Vut) 

p(Vt\St,Ct-l = 1) 

s £ — 1 

p(v t \s t , h t ) J ht i JV{h t ; A at h t -i, £^)p(/lt-i|ot-i,X> «i=t-i) 

= <$c t <d max ^ | r 

p(«t|ft-l,St,Wl:t-l) 

p(S(_l = St, Ct-l = C t + l|(7t, Wl:t) 

p{v t \ht,st)N{ht;y a \Y> at ) , . . 

h r\ p(ct-i = i\a t ,vi:t). 

P(Vt\St, Ct-l = 1) 
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From this recursion we see that p{h t \(Jt, vi-.t) is a mixture of Gaussian with t components (and p(ht\vi-.t) 
is a mixture of Gaussian with Sd max t components). It is therefore feasible to perform inference without 
the collapsing approximation. However, as we will see below, it is computationally more convenient to 
use a modeling for cu as in Section 5.1.2 for which p(ht\vi-.t) is a mixture of Gaussian with Seimax 
components. 

Smoothing 

For smoothing we use the following recursion 

p(ht\<Jt,V±:T) = ^ p(ht\<Tf.t+l,Vl:T)p(crt+l\<7t, Vi-.t) 
CT t+l 

= <5c t =l ^2 p(ht\<Tt:t+l,Vl;T)p(crt+l\<Tt,Vl:T) 



+ 8 ct> ip(h t \a t , St+1 = S t ,C t+ l =c t — l,vi ]T )p(st+i = S t ,C t+ l =Ct — l\at,vi ]T ) ■ 

Notice that in general htJ^Lct+i \ o t , s±+i , vi-t since, for example, the path ct+i, s<+2, «t+2, ht+2, ht+i, ht 
is not blocked. However for c t > 1, p(ht\fft, st+i = s t ,c t +i = c t — 1,vi-.t) = p(ht\crt, «i : t). 

Computing p(hi\a t :t+i, Vi-.t)- 

p(h t \at:t + l,Vl:T) = / p(h t \ht+l,(Tt,St+l,£i^,Vl:t,Vt44rr)p(ht+l\cTt:t+lVl,T) 

J h t+l 

p(h t \h t +i,a t , St+i,v 1: t)p(ht+i\at+i,vi : T)- (4) 



By using the linear system equations defining the evolution of the continuous hidden variables we find 

((ht+i — hl+'i' )(h t — hi' 1 ' ) ) p (h t+1 \s t =i,s t+1 =j,ot=l,v 1 -.t) = A 3 : p t M ' i . 

Thus the joint covariance of p(ht-.t+i\st;t+i, <k, Vi-t) is given by 

p t,i,l p^- l (A J ) T 
Aip*/' 1 A J P t '' l ' ! (A J ) T + E J f 

By using the formula of Gaussian conditioning we find the p(ht\h t +i, at, St+i,i>i:t) is Gaussian with mean 
and covariance given by 



where i^' St+1 = p^^ (A 3t + 1 ) T (A 3 ^ 1 P^^ (A at + 1 ) T + E^ 1 )" 1 . We can now define 

h t = A t h t +i +m t +Vt i 



where m"*'^ 1 = - A^' St+1 A s *+^ h^' and = A/"(0, P t t,CTt — A^ ' St+1 A a * +1 p t t,CTt ) which gives 

a Gaussian distribution for p(h t \at:t+i, vi-.t) with mean and covariance 

/if' CTt:t+1 =ir' St+1 ^i f+1 +m t t ' St+1 

pT,tr t:t + 1 _ J^ t ,s t + i pT,cr t + i ^j^r t ,s t+1 y _|_ pt,<T t _ j»i' ! i+ipt,»,^i t+1 

Therefore p(/i t |<7t, is a mixture of Gaussians with mixture components p(<r t +i|cr t , v\-t)- 
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Computing p(ot+i \<Jt, v i-.T)- We first observe that p(a t \vi : T) can be computed recursively as p(<r t |vi : t) 
E„ t+1 p(&t\<7t+i,vi:T)p(crt+i\vi:T) where 

p(cT t \cTt+l, Vi-.t) = / p(0 , t|ft*+l,0 , t+l,Wl:t,tJt+trf)p(/lt+l|0't+l|«l:T). 

The above integral, the average of the function p(a t \h t+ i, a t +i,Vi :t ) wrt p(h t +i\at+i,v 1: T), cannot be 
estimated in closed form. On the line of (Barber 20061, we thus approximate it as follows 

(p(<Tt\ht+i,fft+l,V 1 :t)} p Q H+1 \ at+1 , Vl . T ) ~ p(a t \h t+1 ,a t+1 ,v 1 . t )\ _-T,a t+1 . (5) 

The term p(at+i\crt, «1:t) can then be easily derived. 
Gaussian Collapsing 

Since collapsing is required only for Ct = 1, the computational cost of the smoothing recursion is given 
by 0(TSd 

max ) • 

Cut of dependence when changing regime. 

p(ht\at,Vi: T ) = <5 c «=i ^ p(ht\ff t ,sihtrC,vi,t,iit^Kf) 2^ P(f*+ikt) «1:t) 

st+l c t+1 
+ 5c t >lp(ft*|0't, St+1 = St, Ct+1 = Ct — 1, U1:t) 
= <5e t = ip(ftt|«7t,«l:t) 

+ 5 Ct >l / p(ftt|ftt+lj°'t)St+l = StlCt4J r ^=-er^^,Wl:t,Wi+^)p(^t+lkt:t+l J 'yi:T) 

From this recursion, it would look like approximations as Q and Q are required. As we will see below, 
when using a modeling for ci-.t as in Section [5.1.2| these will become exact and thus inference can be 
computed exactly. 

7.1.2 Increasing Count-Duration Variables 

Using a modeling for ci-.t as in Section [5.1.2| is beneficial over Section [7,1,1| when cutting dependence 
across regimes, in which case filtering becomes computationally less expensive and the smoothing ap- 
proximations (|4| and Q become exact. 

Filtering 

We can introduce cut of dependence when changing regime by adding a link from c t to h t as in Fig. [9] 
(c). The filtering recursion becomes 

p(ht\a u vut) = 5ct=i > 

p{Vt\St,Ct) 



+ 8 ct >ip{ht\st-i = S t ,C t -l = c t — l,a t ,vi:t)p(s t -i = S t ,C t -l = c t - l\a t ,Vl:t) 
p(vt\h t ,s t )JV(ht;n st ,Z st ) 



— 5 ct =i 
+ 8 ct >i 



p(v t \st,c t ) 

p(v t \ht,s t ) J h N{h t ; A s ' ht-i,Z ! £)p(ht-i\st-i = s t , c t -i = c t - Vi :t -i) 

p(v t \st,c t ) 
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From this recursion we see that p(ht\<Tt,Vi : t) is a Gaussian (and therefore p(ht\v\-t) is a mixture of 
Gaussian with Sdm^ components, as opposed to TSd ma _ x components of Section 7.1.11. 

Notice that a standard change-point model without explicit duration distribution can be obtained by 
setting ct = {2, 1} with 



p(h t \h t -i, s t , c t ) = 



(Af(h t ;p 3t ,Z st ) ifct = l 
\M{h t ;A st ht-i^) ifc t = 2 



p(c t \st-i,s t ) = 



1 if s t -i / s t 

2 if s t -i = s t 



Notice that, in this case, p(ht\s t ,ct = l,Vi:t) is Gaussian, whilst p(h t \s t ,ct = 2, vi-.t) is a mixture of 
Gaussian with t — 1 components (the value Ct does not give information on the value of c t ~i as in the 
previous case), and therefore p(ht\,Vi-t) is a mixture of Gaussian with St components. 

Smoothing 

If we use a modeling for as in Section [5.1.2| we obtain 

p(h t \cT t , V 1:T ) = p(h t \(Tf.t+l,Vl:T)p((T t +l\(Tt, V1-.t) 

= 5 Cf >d min p{ht\(Tt,£**r, Ct+l = l,Vl:t,«t+*ff)p( s t+l> c t+l = l\vt,V 1[T ) 



+ f>c t <d m ^p{ht\crt, st+i = s t , ct+i =Ct + l,v 1:T )p{s t +i = s t , ct+i = c t + l\at,Vi:r) 
= p(ht\a t ,v 1:t )p(ct +1 = l\a t ,v 1:T ) 



+ 6 ct 



<d„ 



p(ht\h t +i,(Jt,St+i = st, Ct 



h t 4 



p(h t+ i\a t , s t+ i = s t ,c t+1 =Ct + l,v 1]T )p(s t+ i = s t ,c t+1 = Ct + l\a t ,v 1]T ) 

Notice that, since Ct+i = d > l,s t +i = implies s t = fe, c t = d — 1, we have p(h t +i\s t = fc, c t = 
d — 1, st+i = fc, c t +i = d, vi-.t) = p(/lt+i|st+i = k, ct+i = d, «i : t), and therefore the approximation in Eq. 
2] becomes exact. 

Notice that, since we established that p(/i t |er t , vi :t ) is Gaussian, at time £ p(ftt|cr t , i>i : t) is a mixture 
of T — t + 1 Gaussians. 

Computing p(<7t+i \(Jt, «i : t)- We first observe that p(crt|vi : T) can be computed recursively as p(<r t |vi : t) 
E„ t+1 p(o"t|crt + i,«i:T)p(o~t+i|i'i : T) where 

p(cTt|«l : r) = ^ p(<7t\(Tt+l,Vl:T)p{<Tt+l\Vl:T) 



^ p(o"iLs*-KT,; 



,Vi-.t,Vt+trf)p(st+l,Ct+l = 1|«1:t) 



+ p((7 f |s t + l = S t ,C t + l = C t + l,V 1:T )p(crt + l\v 1 :T) 
p(crt\v 1:t )p(c t +i = 1|«1:t) + p(&t+l\Vl:T) 

The above average (p(crt|fet+i, (Tt+i, Wi:t)) p (h t+1 | £ r t+1 ,u 1 . T ) cannot be estimated in closed form. We there- 
fore approximate it by evaluation at the mean 



(j)((7 t \ht+l, fft+l, Vl:t))p(ht +1 \a t+u v ltT ) ~ p{<7t\ht+l,<Tt+l,Vl;t)\ 



on the line of (Barber 20061. The term p(<Tt+i\(Tt, Wi:t) can then be easily derived. 
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8 Conclusions 

In this paper, we gave a unified and simple overview of many probabilistic models used for modeling 
time-series data containing different underlying dynamical regimes. This common framework enabled us 
to make connections among models that were not observed in the past, and also to naturally introduce 
a variety of new models and inference routines. 

The paper is therefore to serve as a tutorial for researchers wishing to gain an introduction into the 
subject, with the advantage that the material is presented under a common and consistent modelling 
framework. It is also intended to be of benefit to researchers working in the area, both by offering a dif- 
ferent viewpoint to previous classical review papers and also by introducing several important extensions 
to current models. 
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